KIA Konstanz Intelligence Agency, University of Konstanz - KIAWordCloudVis

VAST 2011 Challenge
Mini-Challenge 3 - Investigation into Terrorist Activity

Authors and Affiliations:

Michael Regenscheit, University of Konstanz, michael.regenscheit@uni-konstanz.de
Christian Scheible, University of Konstanz, christian.scheible@uni-konstanz.de
Thomas Ramm, University of Konstanz, thomas.ramm@uni-konstanz.de

 

Tool(s):

KNIME: Konstanz Information Miner (http://www.knime.org/) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform.

Jigsaw: (http://www.cc.gatech.edu/gvu/ii/jigsaw/index.html) is a visual analytics system that enables analysts and researchers to explore, analyze, and make sense of document collections.

 

KIAWordCloudVis: The Konstanz Intelligence Agency Word Cloud Visualization is a full text search and visualization tool providing a fast overview of a document collection. We developed it for the VAST Challenge 2011 making use of Apache Lucene for indexing and the IBM Word-Cloud Generator for the word clouds.

 

Apache Lucene: (http://lucene.apache.org/java/docs/index.html) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, in particular for cross-platform applications.

 

Word-Cloud Generator: (http://www.alphaworks.ibm.com/tech/wordcloud) is a Java application that creates word clouds from any source text. It's built on the same technology that powers the popular "Wordle" web application.

 

NER: Stanford Named Entity Recognizer (http://nlp.stanford.edu/software/CRF-NER.shtml) is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are proper names, such as person and company names, or gene and protein names.

 

Mallet: MAchine Learning for LanguagE Toolkit (http://mallet.cs.umass.edu/topics.php) Topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. (We used this tool, but none of the resulting topics did fit to our task, so it’s not mentioned in the text)

 

 

Video:

 

 KIA_MC_III_UNI_KN.mov

 

ANSWERS:


MC 3.1 Potential Threats: Identify any imminent terrorist threats in the Vastopolis metropolitan area. Provide detailed information on the threat or threats (e.g. who, what, where, when, and how) so that officials can conduct counterintelligence activities. Also, provide a list of the evidential documents supporting your answer.

The Analytic Process

 

To find the relevant documents we used a combination of different tools and methods as shown in Figure 1. As a first step we retrieved a list of terrorism key words http://www.myvocabulary.com/index.php?dir=wordlist&file=word_list&wordlist_id=197 and manually reviewed and adapted it in about 20 minutes. We used this list to sort the articles according to their relevance (by the number of terror words occurring in the text) and scanned the top results manually to train two classifiers in KNIME. We joined the results of these two algorithms and added further texts containing at least 3 terror words. As a result we got a subset (826) of all documents (4474) containing texts that are very likely to report about terrorist activities. All these steps together took us approximately nine man-hours.

Figure 1: This diagram shows all steps of the analytic process. The blue color indicates automatic preprocessing steps; the red ones are manual steps and the orange ones indicate interactive visual analyses based on different tools.

 

To support an interactive exploration of the documents we developed KIAWordCloudVis, a full text search and visualization tool (Figure 2 and Figure 3), in four man-days. With the help of this tool we performed an iterative search at a powerwall (a huge, high resolution screen) on the whole corpus to detect previously undiscovered interesting documents (about 15 man-hours). Combining the strengths of Lucene, the Stanford NER, a Porter Stemmer and IBMs Word-Cloud Generator the tool is a full text search engine offering possibilities to identify highly relevant texts at a glance. It helped us to get entry points for deeper investigations with KIAWordCloudVis as well as with Jigsaw (Figure 3). In the example below you can see that the interactive visual search with the highlighted terror words (BOMB, SCARE) enabled us instantly to discover a text (274) about a bomb threat in Vastopolis that we had not been aware of before. We will show next how we used this particular result for further investigations with Jigsaw (about 15 man-hours).

 

 

 

Figure 2:  Screenshot of KIAWordCloudVis with the simple boolean search string bomb AND vastopolis. The marked text (274) is one of our evidence texts we used for further investigation in Jigsaw and on the bottom right corner you can see the more detailed version of the same text but showing a text summarization word cloud.

 

Figure 3: Screenshot of KIAWordCloudVis. On the upper left side you can see word clouds representing days, on the upper right side there is a more detailed version of such day clouds. In the lower half there are word clouds for single articles and again on the right side the detailed version. At the very top there is the search field with a complex search string. Selected is a text which is in our result set about intercepted communication from the Network of Dread.

 

Jigsaw offers different possibilities to analyze a corpus. It contains cluster algorithms as well as entity recognition methods and visualizations like parallel coordinates, graphs and scatterplots. We used Stanford NER for entity extraction instead of Jigsaw’s method in order to get consistent results with KIAWordCloudVis and to improve the entity recognition. We also reduced the number of documents loaded in Jigsaw with the methods described in the first paragraph (candidates instead of the whole document space) with the goal to improve the clustering results. With these steps we got two clusters containing almost exclusively relevant texts. We also got a cluster with the Antarctica Airline crash which contains a lot of texts mentioning terror but not important for our task.

 

 

Figure 4: Screenshot is showing Jigsaws cluster- and document-view. We found the cluster with the highlighted text on the left by selecting the relevant documents we found with KIAWordCloudVis (274) in the List View on the right. As you can see the cluster contains 8 of 23 evidence documents (highlighted in the list on the left).

 

Findings

 

Performing the outlined analytical steps led us to several findings (Figure 5) pointing to potential threats. Our first finding is an attack with a dirty bomb by the network of dread, an overseas terror group. There are several hints for this attack starting with intercepted communication indicating attacks across the country. Also there were some threatening emails to VastPress one day later. They might have tried to get radioactive material by ship because radioactive cargo was found at Vastopolis harbor. Several days later a plot to detonate a dirty bomb in an American city was revealed. This shows that this threat is really serious.

 

The second scenario is about bioterrorism where two different terror groups are involved. The first hint is that the molecular biologist Prof. Patino gave a talk on new dangers of bioterrorism, saying that it is much easier today “to engineer dangerous microbes with the right equipment”. On the 18th of April the CDC published an article saying that an easy way to spread a disease would be food poisoning. There was also some biological equipment stolen which could be used to cultivate bacteria. Maybe it was this equipment which was found when they arrested members of PoC building a laboratory. But this doesn’t rule out the possibility that they are going to contaminate food, because on May 15th two members of PoC where trespassing near a loading dock of a food preparation plant. So it’s likely that they are planning an attack on food.

 

The other group which might realize an attack is the Citizens for the Ethical Treatment of Lab Mice. But this scenario is unlikely because the found evidence sounds rather harmless (including issues like trashing Prof. Patino’s garage and screaming at his neighbors).

 

Our third scenario summarizes miscellaneous attacks in Vastopolis. There were weapons stolen from the armed forces and part of it was found in the car of a network of hate members. So it is possible that this group planned an assault with military weapons. Another group called psycho brotherhood tries to build bombs and might want to detonate them in Vastopolis.

 

                                                                                                                          

Figure 5 : Timeline of events showing three possible threats for Vastopolis. The first one is an attack with a (nuclear) dirty bomb, the second one a bio-terroristic assault and the last one combines miscellaneous threats.

 

Evidential Documents

 

8

62

129

274

383

499

1088

1671

1691

1785

1878

2287

2395

3040

3212

3229

3231

3232

3435

3563

4080